Team members

Leon Harper (21385662)

Thomas Newton (21365654)

Michal Jedruszczak (21440496)

These are the team members for the group project.

Step 1: Problem statement

World of Warcraft Problem Statement

World of Warcraft is a popular MMORPG (Massively multiplayer online role-playing game) video game with millions of user every month. Due to having this many players there is bound to be some cyberbullying/toxic players included in those millions of players, World of Warcraft however is especially toxic and is often ranked as one of the most toxic gaming community’s today. In a survey by ADL (Anti-Defamation League) it was found 66% of adults ages 18-45 have been harassed/bullied in World of Warcraft in 2021 (Hate is No Game: Harassment and Positive Social Experiences in Online Games 2021).

The objective of this project is to create a model that will be able to detect cyberbullying/toxicity. By using the World of Warcraft dataset provided to us it will allow the model to have reference for comments deemed as bullying.

This would be a classification model that when give a comment/statement would decided whether it is bullying or not bullying, it will be able to do this by detecting certain words and phrases.

Currently most video games have an option to filter chat, however this only censors certain words/phrases. Our model will be able to detect strings of words rather than just certain ones.

League of Legends problem statement

League of Legends is a popular MOBA (Multiplayer online battle arena) video game with millions of users every month. Due to having this many players there is bound to be some cyberbullying/toxic players included in those millions of players, League however is especially toxic and is often ranked as one of the most toxic gaming community’s today. The current ban rate of all accounts as of September 2022 is 2.25% (Around 2,632,500 account). In a survey by ADL (Anti-Defamation League) it was found 65% of adults ages 18-45 have been harassed/bullied in League of Legends in 2021.

The objective of this project is to create a model that will be able to detect cyberbullying/toxicity. By using the League of legends dataset provided to us it will allow the model to have reference for comments deemed as bullying.

This would be a classification model that when give a comment/statement would decided whether it is bullying or not bullying, it will be able to do this by detecting certain words and phrases.

Currently most video games have an option to filter chat, however this only censors certain words/phrases. Our model will be able to detect strings of words rather than just certain ones.

ADL Survey: Hate is No Game: Harassment and Positive Social Experiences in Online Games 2021 (adl.org) # Step 2: Importing data

wow_posts_df <- read.csv("Data/posts_wow.csv")
wow_annotations_df <- read.csv("Data/annotations_wow.csv")

lol_posts_df <- read.csv("Data/posts_lol.csv")
lol_annotations_df <- read.csv("Data/annotations_lol.csv")

This imports the required data for the project. The data was exported from an SQL script that creates the necessary tables (i.e. posts and annotations) and the data. To simplify the process of importing data, we used the table export wizard to export the SQL table data into csv files using custom SQL as the MySQL Workbench Table Export Wizard doesn’t export all of the data properly.

Step 2.5: Preliminary EDA

We are doing a preliminary EDA in order to understand how we should clean the data and the kind of data that we are dealing with.

League of Legends analysis

In our League of Legends posts dataset, we are given 16867 rows of 5 features: topic_id, post_number, author, html_message and timestamp. For the purposes of this project, we will ignore timestamp, as it will not be used to train our models. The additional LoL annotations dataset reveals which messages were flagged as cyberbullying, from which the only useful features are the post_number and topic_id. This information has been combined into a single file – clean_posts_balanced_sample.csv.

Posts in the main dataset are formatted using HTML, meaning that our model will either have to be trained to recognise patterns such as paragraph breaks or we will have to clean the data and transform it into something more appropriate.

Furthermore, as we are only using a single set of features (post_number and topic_id) from the annotations dataset, we may be able to add another column to the posts dataset – a Boolean value representing whether or not a specific post contains cyberbullying. This will act as the target variable for our model building process.

Regarding the types of data we’re given, topic IDs, authors and HTML messages are categorical, while post numbers are ordinal. In the annotations dataset, all values but post number are categorical.

summary(lol_posts_df)
##     topic_id       post_number      author          html_message      
##  Min.   :   2.0   Min.   :   0   Length:16867       Length:16867      
##  1st Qu.:  33.0   1st Qu.: 364   Class :character   Class :character  
##  Median : 140.0   Median : 953   Mode  :character   Mode  :character  
##  Mean   : 604.6   Mean   :1393                                        
##  3rd Qu.:1090.0   3rd Qu.:1998                                        
##  Max.   :2481.0   Max.   :5358                                        
##   timestamp        
##  Length:16867      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
nrow(lol_posts_df)
## [1] 16867

There are 16867 data points in our dataset.

lol_posts_df %>% select("topic_id") -> lol_topics #gets all unique lol_topics from the dataset
nrow(unique(lol_topics))
## [1] 17

This data set has a total of 17 different topics.

annotated_messages_lol <- merge(x = lol_posts_df, y = lol_annotations_df, by = c("topic_id","post_number"), all = TRUE) #outer join lol_posts_df and lol_annotations_df

annotated_messages_lol %>% filter(annotator %in% c("1","2","3")) -> annotated_messages_lol

annotated_messages_lol[!duplicated(annotated_messages_lol[3]),] -> annotated_messages_lol

annotated_messages_lol %>% drop_na() -> annotated_messages_lol
annotated_messages_lol

This code shows all annotated messages (messages that have been marked as cyberbullying). This is accomplished through performing an outer join between lol_posts_df and lol_annotations_df, then filtering out all the messages that have been annotated and removing null rows. Finally, we remove all duplicate messages.

100 * count(annotated_messages_lol)/count(lol_posts_df)#percentage of posts marked as cyberbullying

For this data set 0.9% of the posts are labeled as bullying. To make the data less skewed, we need to take a balanced sample of bullying and non-bullying messages.

lol_annotations_df %>% select("annotator") -> lol_ann #gets all unique annotators from the annotations dataset
nrow(unique(lol_ann))
## [1] 3

We can see that there are three unique annotators for this dataset.

lol_annotations_df %>% select("offender") -> lol_off #gets all unique offenders from the annotations dataset
nrow(unique(lol_off))
## [1] 152

152 unique offenders have been identified by the annotators.

World of Warcraft analysis

In our World of Warcraft posts dataset, we are given 4921 rows of 5 features: topic_id, post_number, author, html_message and timestamp. For the purposes of this project, we will ignore timestamp, as it will not be used to train our models. The additional WoW annotations dataset reveals which messages were flagged as cyberbullying, from which the only useful features are the post_number and topic_id. This information has been combined into a single file – clean_posts_balanced_sample.csv.

Posts in the main dataset are formatted using HTML, meaning that our model will either have to be trained to recognise patterns such as paragraph breaks or we will have to clean the data and transform it into something more appropriate.

Furthermore, as we are only using a single set of features (post_number and topic_id) from the annotations dataset, we may be able to add another column to the posts dataset – a Boolean value representing whether or not a specific post contains cyberbullying. This will eliminate the need for the use of two separate datasets to develop our models.

Regarding the types of data we’re given, topic IDs, authors and HTML messages are categorical, while post numbers are ordinal. In the annotations dataset, all values but post number are categorical.

summary(wow_posts_df)
##    topic_id          post_number      author          html_message      
##  Length:16978       Min.   :   0   Length:16978       Length:16978      
##  Class :character   1st Qu.: 226   Class :character   Class :character  
##  Mode  :character   Median : 708   Mode  :character   Mode  :character  
##                     Mean   :1366                                        
##                     3rd Qu.:2420                                        
##                     Max.   :4692                                        
##                     NA's   :3                                           
##   timestamp        
##  Length:16978      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
nrow(wow_posts_df)
## [1] 16978

There are 4921 data points in our dataset.

wow_posts_df %>% select("topic_id") -> wow_topics #gets all unique wow_topics from the dataset
nrow(unique(wow_topics))
## [1] 23

This data set has a total of 11 different topics.

annotated_messages_wow <- merge(x = wow_posts_df, y = wow_annotations_df, by = c("topic_id","post_number"), all = TRUE) #outer join wow_posts_df and wow_annotations_df

annotated_messages_wow %>% filter(annotator %in% c("1","2","3")) -> annotated_messages_wow

annotated_messages_wow[!duplicated(annotated_messages_wow[3]),] -> annotated_messages_wow

annotated_messages_wow %>% drop_na() -> annotated_messages_wow
annotated_messages_wow

This code shows all annotated messages (messages that have been marked as cyberbullying). This is accomplished through performing an outer join between wow_posts_df and wow_annotations_df, then filtering out all the messages that have been annotated and removing null rows. Finally, we remove all duplicate messages.

100 * count(annotated_messages_wow)/count(wow_posts_df)#percentage of posts marked as cyberbullying

For this data set 0.91% of the posts are labeled as bullying. To make the data less skewed, we need to take a balanced sample of bullying and non-bullying messages.

wow_annotations_df %>% select("annotator") -> wow_ann #gets all unique annotators from the annotations dataset
nrow(unique(wow_ann))
## [1] 3

We can see that there are three unique annotators for this dataset.

wow_annotations_df %>% select("offender") -> wow_off #gets all unique offenders from the annotations dataset
nrow(unique(wow_off))
## [1] 102

102 unique offenders have been identified by the annotators.

Step 3: Cleaning the data / Pre-processing

For pre-processing, we decided to merge the datasets as they are essentially identical to each other in terms of structure and we believed that this would simplify data pre-processing and remove code duplication. We also ignored the annotation csv files (apart from feature extraction) as we found that there wasn’t anything to clean (we would be just cleaning metadata). As we merged the datasets, we found that we could create a “is_bullying” column that would be used as a target variable for the model building process. We also created a “bullying_severity” column as each post can have multiple annotators for each cyberbullying post.

After that, we extracted the messages out of the HTML in order to then pre-process the messages through stemming, removing punctuation and removing stopwords (the messages are HTML as both datasets come from web forums). We then extracted a “word_counts” column for EDA purposes as well as sampling the data in order to balance it out. We then created training and test datasets for model building where each dataset is a document term matrix as we cannot pass in raw text data to the models. Finally, we export the relevant data (including clean data).

Merging data

# Creates dataset column to merge posts and annotations csv files together
wow_posts_df$dataset <- "WoW"
lol_posts_df$dataset <- "LoL"
wow_annotations_df$dataset <- "WoW"
lol_annotations_df$dataset <- "LoL"

posts_df <- rbind(wow_posts_df, lol_posts_df)
annotations_df <- rbind(lol_annotations_df, wow_annotations_df)

Since wow_posts_df and lol_posts_df have the same structure, we merged the posts and annotation data frames together to simplify pre-processing (this avoids repeating code). However, we will need to analyse the datasets separately for EDA purposes so we created a “dataset” feature to counteract this.

posts_df$id <- paste(posts_df$dataset, posts_df$topic_id, posts_df$post_number, sep="_")
annotations_df$id <- paste(annotations_df$dataset, annotations_df$topic_id, annotations_df$post_number, sep="_")

To simplify the merging of data frames, we will create an ID column so that a left join can be performed on a single column. This mitigates the issues of duplicate topic ids and post numbers as the post numbers are only unique according to the topic id.

merged_df <- left_join(posts_df, annotations_df, by = "id", keep=TRUE)
merged_df$is_bullying <- as.integer(!is.na(merged_df["id.y"]))
drop <- c('topic_id.y', 'post_number.y', 'dataset.y', 'id.y', 'offender', 'victim')
merged_df <- merged_df[, !(names(merged_df) %in% drop)]

# Removes the ".x" characters from the remaining annotations columns
colnames(merged_df) = sub(".x", "", colnames(merged_df))

# Create bullying_severity column
names(merged_df)[names(merged_df) == "annotator"] <- "bullying_severity"
merged_df["bullying_severity"][is.na(merged_df["bullying_severity"])] <- 0
posts_df <- merged_df %>% group_by(id) %>% slice(which.max(bullying_severity))

This code performs a left join to merge the dataframes together. Most of the columns from the annotations dataframe are useless for training an NLP classifier so we will be removing those columns. Since there are duplicate columns on each side, we will be dropping “y” columns.

We also created a bullying_severity column as we found that some posts have been annotated as bullying by multiple annotators which could make this a useful feature for model building.

Removing HTML + pre-processing

remove_html <- function(html_msg, isHtml) {
  if(isHtml) {
    # Remove backslashes when dealing with LoL forum data
    html_msg <- gsub("\\\\", '', html_msg)
    # Get XML nodes
    msg <- xml2::read_html(html_msg)
    # Get the block quotes and quotes (blockquotes for WoW, .quote for LoL)
    blockquotes <- msg %>% html_nodes("blockquote")
    quotes <- msg %>% html_nodes(".quote")
    
    # Remove quote elements for LoL and WoW datasets
    xml_remove(blockquotes)
    xml_remove(quotes)
    msg <- html_text(msg)
    return(msg)
  }
  return(html_msg)
}

The “html_message” column has messages that have HTML and do not contain HTML at all. In order to handle this, we will be creating a “is_html” column that uses a regular expression to detect HTML in order to prevent errors with RVest. The “tm” package does not handle removing HTML content and we cannot simply use a regular expression to remove HTML as the data originates from gaming forums where ”

” elements are frequently used. If we used a regular expression then the content inside the blockquotes would still remain.

To remove the content of the blockquotes, we used RVest to acquire the blockquote element contents as well as any

elements with “.quote” and then we use xml_remove() to remove the blockquote element nodes. We then convert the RVest object back into a string.

# Regex for detecting HTML
detect_html_regex <- "<.*?>"
# Create is_html column
posts_df$is_html <- str_detect(posts_df$html_message, detect_html_regex)
# Apply remove_html function to html_message
posts_df$html_message <- mapply(remove_html, posts_df$html_message, posts_df$is_html)
posts_df <- posts_df[, !(names(posts_df) %in% 'is_html')]

# Converts any regex passed into the transformer into a space character
toSpaceTransformer <- content_transformer(function (x, pattern) gsub(pattern, "", x))
posts_corpus <- Corpus(VectorSource(posts_df$html_message))
posts_corpus <- posts_corpus %>% 
                tm_map(content_transformer(tolower)) %>%
                tm_map(toSpaceTransformer, "http\\S+\\s*") %>%
                tm_map(removeNumbers) %>%
                tm_map(removeWords, stopwords("english")) %>%
                tm_map(removePunctuation) %>%
                tm_map(stemDocument) %>%
                tm_map(stripWhitespace)
posts_df$html_message <- data.frame(text=sapply(posts_corpus, identity), stringsAsFactors = F)$text

This code removes useless characters, stopwords, punctuation and it uses stemming to improve model performance. Certain steps of the pre-processing could be tweaked to improve model performance (e.g. number of stopwords being omitted) as the pre-processing could end up being too rigorous.

We removed the HTML characters first in order to prevent interference when removing punctuation or whitespace.

Word counts

posts_df$word_counts <- str_count(posts_df$html_message, "\\S+")

This code gets the word counts for the html messages which can be used for analysing word counts in the EDA. We may also use the word counts to filter messages with word counts that are too low.

posts_df <- posts_df %>% na_if("") %>% na.omit

This code removes NaN rows from posts_df which can become a problem after pre-processing if there were too many stop words in the original messages.

write.csv(posts_df, file="Data/clean_posts.csv")

This code exports the clean posts to a csv file to be analysed separately. This also comes in handy in order to save time when performing EDAs as pre-processing can take time (especially on slow computers).

Creating DTM

DTM for combined dataset

corpus = VCorpus(VectorSource(posts_df$html_message))
dtm = DocumentTermMatrix(corpus)
dtm = removeSparseTerms(dtm, 0.999)
posts_data = as.data.frame(as.matrix(dtm))
posts_data$is_bullying = as.factor(posts_df$is_bullying)

DTM for separate datasets

lol_posts <- posts_df %>% filter(dataset=="LoL")
wow_posts <- posts_df %>% filter(dataset=="WoW")

lol_corpus = VCorpus(VectorSource(lol_posts$html_message))
lol_dtm = DocumentTermMatrix(lol_corpus)
lol_dtm = removeSparseTerms(lol_dtm, 0.999)
lol_posts_data = as.data.frame(as.matrix(lol_dtm))
lol_posts_data$is_bullying = as.factor(lol_posts$is_bullying)

wow_corpus = VCorpus(VectorSource(wow_posts$html_message))
wow_dtm = DocumentTermMatrix(wow_corpus)
wow_dtm = removeSparseTerms(wow_dtm, 0.999)
wow_posts_data = as.data.frame(as.matrix(wow_dtm))
wow_posts_data$is_bullying = as.factor(wow_posts$is_bullying)

We create a document term matrix from the html messages and we remove sparse terms (i.e. empty values) using removeSparseTerms. We then assign a “is_bullying” column for model building and EDA purposes.

ggplot(data=posts_data, aes(x=is_bullying)) + geom_bar()

ggplot(data=wow_posts_data, aes(x=is_bullying)) + geom_bar()

ggplot(data=lol_posts_data, aes(x=is_bullying)) + geom_bar()

As we can see, the data is heavily imbalanced where there isn’t many bullying cases. This will result in the classifier being trained to where it is more accurate at classifying non-bullying cases rather than bullying cases. We will use undersampling because we have plenty of non-bullying data but not enough data for bullying cases (this means we can afford to reduce how much data we are dealing with).

Sampling the data

League of Legends dataset

is_bullying = which(lol_posts_data$is_bullying == 1)
not_bullying = which(lol_posts_data$is_bullying == 0)
nsamp = min(length(is_bullying), length(not_bullying))
sample_bullying = sample(is_bullying, nsamp)
sample_not_bullying = sample(not_bullying, nsamp)
lol_posts_data_balanced = lol_posts_data[c(sample_bullying, sample_not_bullying),]

World of Warcraft dataset

is_bullying = which(wow_posts_data$is_bullying == 1)
not_bullying = which(wow_posts_data$is_bullying == 0)
nsamp = min(length(is_bullying), length(not_bullying))
sample_bullying = sample(is_bullying, nsamp)
sample_not_bullying = sample(not_bullying, nsamp)
wow_posts_data_balanced = wow_posts_data[c(sample_bullying, sample_not_bullying),]

Combined dataset

is_bullying = which(posts_data$is_bullying == 1)
not_bullying = which(posts_data$is_bullying == 0)
nsamp = min(length(is_bullying), length(not_bullying))
sample_bullying = sample(is_bullying, nsamp)
sample_not_bullying = sample(not_bullying, nsamp)
posts_data_balanced = posts_data[c(sample_bullying, sample_not_bullying),]

This creates a sample of the bullying data for balancing purposes. However, this comes at the expense of having much less data to work with as there are significantly less cyberbullying cases versus non-cyberbullying cases.

ggplot(data=posts_data_balanced, aes(x=is_bullying)) + geom_bar()

ggplot(data=wow_posts_data_balanced, aes(x=is_bullying)) + geom_bar()

ggplot(data=lol_posts_data_balanced, aes(x=is_bullying)) + geom_bar()

Train/test split

set.seed(42)
part <- sample(2, nrow(posts_data), replace=TRUE, prob=c(0.6, 0.4))
train <- posts_data[part == 1, ]
test <- posts_data[part == 2, ]

We split the data using a 60:40 train-test split.

Combined data

set.seed(42)
part <- sample(2, nrow(posts_data_balanced), replace=TRUE, prob=c(0.6, 0.4))
train_balanced <- posts_data_balanced[part == 1, ]
test_balanced <- posts_data_balanced[part == 2, ]

We split the data using a 60:40 split. This is for the balanced data.

League of Legends dataset

set.seed(42)
part <- sample(2, nrow(lol_posts_data_balanced), replace=TRUE, prob=c(0.6, 0.4))
lol_train_balanced <- lol_posts_data_balanced[part == 1, ]
lol_test_balanced <- lol_posts_data_balanced[part == 2, ]

World of Warcraft dataset

set.seed(42)
part <- sample(2, nrow(wow_posts_data_balanced), replace=TRUE, prob=c(0.6, 0.4))
wow_train_balanced <- wow_posts_data_balanced[part == 1, ]
wow_test_balanced <- wow_posts_data_balanced[part == 2, ]

Exporting cleaned data

write.csv(posts_data_balanced, file="Data/Balanced/clean_posts_dtm_balanced_sample.csv")
write.csv(wow_train_balanced, file="Data/Balanced/clean_wow_train_balanced_sample.csv")
write.csv(wow_test_balanced, file="Data/Balanced/clean_wow_test_balanced_sample.csv")
write.csv(lol_train_balanced, file="Data/Balanced/clean_lol_train_balanced_sample.csv")
write.csv(lol_test_balanced, file="Data/Balanced/clean_lol_test_balanced_sample.csv")
write.csv(train_balanced, file="Data/Balanced/train_balanced.csv")
write.csv(test_balanced, file="Data/Balanced/test_balanced.csv")
write.csv(train, file="Data/train.csv")
write.csv(test, file="Data/test.csv")
write.csv(posts_data, file="Data/clean_posts_dtm.csv")

We export the training and test data to make steps such as model building and EDA easier and more convenient as it can take time to pre-process the data (especially on slower computers).

Step 4: EDA

League of Legends analysis

posts_df %>% filter(dataset == "LoL") -> lol_posts
lol_posts %>% filter(is_bullying == 1) -> bullying_lol

Here we filter out our LoL data, and from that data we filter out all the messages that have been marked as cyberbulling. This allows us to calculate the percentage of cyberbullying cases in our dataset.

nrow(bullying_lol) / nrow(lol_posts) * 100
## [1] 1.530429

For this dataset 1.53% of the posts are labeled as bullying.

lol_tibble <- tibble(txt = lol_posts$html_message)
lol_tibble #transforming the HTML messages into a tibble for an easier workflow
lol_tibble <- lol_tibble%>% 
    mutate(linenumber = row_number()) %>%
 unnest_tokens(word, txt) %>% anti_join(stop_words)
lol_tibble #splitting tibble by words
lol_counts <- lol_tibble %>% count(word, sort=TRUE)
lol_counts #sorting words by count
wordcloud(lol_counts$word, lol_counts$n, max.words = 250,
          min.freq=25, random.order=FALSE, colors=brewer.pal(8, "Dark2"))

  #creating a word cloud out of sorted list

By using a word cloud to analyse the HTML messages from the League of Legends boards, we have gained several insights into the issue of cyberbullying within the game.

The word helped us to identify the most common words or phrases used in the chat. This allowed us to see if certain language or terminology was frequently used in a negative or bullying context. For example, the words “stupid” and “idiot” were often marked as cyberbullying, which could indicate that players were using those words to mock or belittle others.

lol_bigrams <- tibble(txt = lol_posts$html_message) %>% 
  unnest_tokens(bigram, txt, token = "ngrams", n = 2)
lol_bigrams
#split original tibble into two-word bigrams

Here we split the LoL tibble into bigram tokens. This allows us to perform an n-gram analysis and see the most common relationships between words.

lol_bigrams <- lol_bigrams %>%
      separate(bigram, c("word1", "word2"), sep = " ") 
lol_bigrams #separating them out for easier cleaning
lol_bigrams <- lol_bigrams %>%
      filter(!word1 %in% stop_words$word) %>%
      filter(!word2 %in% stop_words$word)
lol_bigrams #filtering out unwanted words
lol_bigrams <- lol_bigrams %>%
            filter(!is.na(word1)) %>% 
            filter(!is.na(word2))
lol_bigrams #removing null values
lol_bigrams <- lol_bigrams %>%
            unite(bigram, word1, word2, sep=" ")
lol_bigrams #joining the words back together

This code splits the bigram to remove any stop words, usernames and null values before joining the bigrams back together.

lol_bigram_counts <- lol_bigrams %>% count(bigram, sort=TRUE)
lol_bigram_counts #counting and sorting bigrams
lol_bigram_counts %>% 
  filter(str_detect(lol_bigram_counts$bigram,"[0-9]", negate = TRUE)) -> lol_bigram_counts
lol_bigram_counts #removing any bigrams with numbers

Here we remove any bigrams that contain numbers, as they don’t provide any useful information.

lol_filtered_bigrams <- lol_bigram_counts %>%
                  filter(n >= 4)
lol_filtered_bigrams #sorting remaining bigrams with a frequency of 4 or more

This section is responsible for filtering out any bigrams that have a frequency of 4 or more. 4 was chosen as it was a compromise between how many bigrams could be visualised and how much information we could get out of the graphical representation.

lol_separated_bigrams <- lol_filtered_bigrams %>% 
  select("bigram") %>%
  separate(bigram, c("word1", "word2"), sep = " ") 
lol_separated_bigrams #separating bigrams again, preparing for graphical representation
lol_bigram_graph <- lol_separated_bigrams %>%
                  graph_from_data_frame()
lol_bigram_graph  #creating bigram graph
## IGRAPH 319ffde DN-- 1761 6031 -- 
## + attr: name (v/c)
## + edges from 319ffde (vertex names):
##  [1] lp      ->ffsgive  late    ->game     twin    ->fang     play    ->game    
##  [5] earli   ->game     game    ->mode     play    ->dominion rank    ->queue   
##  [9] play    ->rank     leagu   ->legend   peopl   ->play     poison  ->mage    
## [13] rank    ->dominion summon  ->rift     rank    ->game     dominion->player  
## [17] player  ->base     game    ->play     share   ->account  account ->share   
## [21] twist   ->treelin  lane    ->bulli    win     ->rate     loss    ->prevent 
## [25] mana    ->cost     ap      ->ratio    enemi   ->team     lane    ->phase   
## [29] everi   ->singl    fortun  ->teller   noxious ->blast    everi   ->time    
## + ... omitted several edges
ggraph(lol_bigram_graph, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)

For this project we used a bigram representation to quickly visualize the most common two-word phrases in this large text dataset. This helped us to get a sense of the collocation of words in the data, which gave important insights into the meaning and context of the text.

Additionally, using a bigram representation allowed us to easily identify any unusual or unexpected combinations of words that might be present in the dataset. This was particularly useful when working with unstructured data, as it allowed us to quickly see into some of the relationships between words within the text. By identifying the most common two-word phrases, we were able to see how words were collocating in the dataset, which gave us an idea of how they were being used and the relationships between them. This helped us to identify patterns and themes in the data that would have been difficult to discern by just looking at individual words.

In summary, using a bigram representation was a valuable tool for our exploratory data analysis, as it helped us to quickly gain a better understanding of our dataset by identifying patterns and themes, and also helped us to identify any unusual or unexpected combinations of words that might be present in the HTML message data.

World of Warcraft analysis

wow_posts <- posts_df %>% filter(dataset == "WoW")
posts_df %>% filter(is_bullying == 1) -> bullying_wow
bullying_wow

Here we filter out our WoW data sample, and from that data we filter out all the messages that have been marked as cyberbulling. This allows us to calculate the percentage of cyberbullying cases in our dataset.

nrow(bullying_wow) / nrow(wow_posts) * 100
## [1] 2.437285

For this data set 2.4% of the posts are labeled as bullying.

wow_tibble <- tibble(txt = wow_posts$html_message)
wow_tibble #transforming the HTML messages into a tibble for an easier workflow

The tibble data type is more flexible than a string, which allows us to perform a larger variety of operations on it - such as filtering.

wow_tibble <- wow_tibble%>% 
    mutate(linenumber = row_number()) %>%
 unnest_tokens(word, txt) %>% anti_join(stop_words)
## Joining, by = "word"
wow_tibble #splitting tibble by words and removing stop words
wow_counts <- wow_tibble %>% count(word, sort=TRUE)
wow_counts #sorting and counting the remaining words

This section separates the tibble by line number and then by word. Stop words such as “it”, “his” and “the” are removed. Then the remaining words are sorted by how frequently they appear in our data.

wordcloud(wow_counts$word, wow_counts$n, max.words = 250,
          min.freq=25, random.order=FALSE, colors=brewer.pal(8, "Dark2"))

  #creating word cloud

We used a word cloud to quickly visualise the most common words in our large text dataset. This helped us to get a sense of the overall topic and context of the data, as well as identify any patterns or themes that might be present.

Additionally, using a word cloud allowed us to easily identify any outliers or unusual words that might be present in the dataset, which could then be further investigated. This was particularly useful when working with unstructured data, as it allowed us to quickly gain insights without having to manually sift through all of the text. Using a word cloud was a valuable tool in our exploratory data analysis and helped us to quickly gain a better understanding of the dataset.

wow_bigrams <- tibble(txt = wow_posts$html_message) %>% 
  unnest_tokens(bigram, txt, token = "ngrams", n = 2)
wow_bigrams #split original tibble into two-word bigrams

Here we split the WoW tibble into bigram tokens. This allows us to perform an n-gram analysis and see the most common relationships between the words used.

wow_bigrams <- wow_bigrams %>%
      separate(bigram, c("word1", "word2"), sep = " ") 
wow_bigrams #separating the words for easier cleaning
wow_bigrams <- wow_bigrams %>%
      filter(!word1 %in% stop_words$word) %>%
      filter(!word2 %in% stop_words$word)
wow_bigrams #filtering out unwanted words
wow_bigrams <- wow_bigrams %>%
            filter(!is.na(word1)) %>% 
            filter(!is.na(word2))
wow_bigrams #removing null values
wow_bigrams <- wow_bigrams %>%
            unite(bigram, word1, word2, sep=" ")
wow_bigrams #joining the words back together

This code splits the bigram to remove any stop words, usernames and null values before joining the bigrams back together.

wow_bigram_counts <- wow_bigrams %>% count(bigram, sort=TRUE)
wow_bigram_counts #counting and sorting bigrams

Here we display a sorted list of bigrams by their frequency.

wow_bigram_counts %>% 
  filter(str_detect(wow_bigram_counts$bigram,"[0-9]", negate = TRUE)) -> wow_bigram_counts
wow_bigram_counts #removing any bigrams with numbers

This function removes any bigrams containing numbers, as they don’t relate to our project.

wow_filtered_bigrams <- wow_bigram_counts %>%
                  filter(n >= 4)
wow_filtered_bigrams #sorting remaining bigrams with a frequency of 4 or more

Here we filter the bigrams, keeping only the ones that appear 4 or more times in the dataset. The number 4 was chosen due to how many relationships can be represented at once using the ggraph library.

wow_separated_bigrams <- wow_filtered_bigrams %>% 
  select("bigram") %>%
  separate(bigram, c("word1", "word2"), sep = " ") 
wow_separated_bigrams #separating bigrams again, preparing for graphical representation
wow_bigram_graph <- wow_separated_bigrams %>%
                  graph_from_data_frame()
wow_bigram_graph #creating bigram graph
## IGRAPH 5c99086 DN-- 1898 8398 -- 
## + attr: name (v/c)
## + edges from 5c99086 (vertex names):
##  [1] connect->realm       realm  ->connect     pvp    ->realm      
##  [4] play   ->game        pve    ->realm       mani   ->peopl      
##  [7] classic->server      vanilla->server      rp     ->realm      
## [10] lot    ->peopl       classic->realm       fli    ->mount      
## [13] popul  ->realm       peopl  ->play        argent ->dawn       
## [16] hellfir->hellfir     low    ->popul       confirm->kill       
## [19] play   ->wow         vanilla->realm       max    ->level      
## [22] real   ->life        world  ->warcraft    defia  ->brotherhood
## + ... omitted several edges
ggraph(wow_bigram_graph, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)

The bigram representation helped us to identify the most common two-word phrases used in the chat. This allowed us to see the distribution of words and understand how players were communicating with each other. For example, the phrase “commit suicide” can be seen in the bigram - messages with this phrase often were flagged as cyberbulling.

Step 5: Predictive modelling

Initial setup

train_balanced$is_bullying = as.factor(train_balanced$is_bullying)
test_balanced$is_bullying = as.factor(test_balanced$is_bullying)

lol_train_balanced$is_bullying = as.factor(lol_train_balanced$is_bullying)
lol_test_balanced$is_bullying = as.factor(lol_test_balanced$is_bullying)
wow_train_balanced$is_bullying = as.factor(wow_train_balanced$is_bullying)
wow_test_balanced$is_bullying = as.factor(wow_test_balanced$is_bullying)

set.seed(42)
train_control = trainControl(method = "cv", number = 5)

We are using cross validation to help assess how the model will generalize onto new data and to detect overfitting. The aim here is to have a better estimate of the out-of-sample performance of the model.

SVM

set.seed(42)

tic()
svm_model = caret::train(is_bullying~., data=train_balanced , method ="svmLinear" , trControl = train_control, verbose=FALSE)
svm_toc <- toc(quiet=T)

tic()
svm_wow_model = caret::train(is_bullying~., data=wow_train_balanced, method ="svmLinear" , trControl = train_control, verbose=FALSE)
svm_wow_toc <- toc(quiet=T)

tic()
svm_lol_model = caret::train(is_bullying~., data=lol_train_balanced, method ="svmLinear" , trControl = train_control, verbose=FALSE)
svm_lol_toc <- toc(quiet=T)

svm_time_taken <- svm_toc$toc - svm_toc$tic
svm_lol_time_taken <- svm_lol_toc$toc - svm_lol_toc$tic
svm_wow_time_taken <- svm_wow_toc$toc - svm_wow_toc$tic

svm_pred_y = predict(svm_model, test_balanced)
svm_wow_pred_y = predict(svm_wow_model, wow_test_balanced)
svm_lol_pred_y = predict(svm_lol_model, lol_test_balanced)

Here we create the trained model, with “is_bullying” as an output as this is what we want to test The method is the type of model we want to use trControl takes control of our parameters

SVM (Support Vector Machine) is a model that uses classification algorithms for two-group classification problems. We used an SVM model here due to the fact our dataset is balanced. We also decided to try out a linear kernel as we believe that we are working with linearly separable datasets and we have found that linear kernels are common in text classification problems.

MLP

set.seed(42)

tic()
mlp_model = caret::train(is_bullying~., data=train_balanced , method ="mlp" , trControl = train_control, verbose=FALSE)
mlp_toc <- toc(quiet=T)

tic()
mlp_wow_model = caret::train(is_bullying~., data=wow_train_balanced , method ="mlp" , trControl = train_control, verbose=FALSE)
mlp_wow_toc <- toc(quiet=T)

tic()
mlp_lol_model = caret::train(is_bullying~., data=lol_train_balanced , method = "mlp" , trControl = train_control, verbose=FALSE)
mlp_lol_toc <- toc(quiet=T)

mlp_time_taken <- mlp_toc$toc - mlp_toc$tic
mlp_wow_time_taken <- mlp_wow_toc$toc - mlp_wow_toc$tic
mlp_lol_time_taken <- mlp_lol_toc$toc - mlp_lol_toc$tic

mlp_pred_y = predict(mlp_model, test_balanced)
mlp_lol_pred_y = predict(mlp_lol_model, lol_test_balanced)
mlp_wow_pred_y = predict(mlp_wow_model, wow_test_balanced)

Here we create the trained model, with “is_bullying” as an output as this is what we want to test The method is the type of model we want to use trControl takes control of our parameters

MLP (Multilayer Perceptrons) are very flexible and can be used generally to learn a mapping from inputs to outputs. MLPs are suitable for classification prediction problems which is exactly what we want. They are comprised of one or more layers of neurons. Data is fed to the input layer, there may be one or more hidden layers providing levels of abstraction, and predictions are made on the output layer, also called the visible layer.

glmnet

set.seed(42)

tic()
glmnet_model = caret::train(is_bullying~., data=train_balanced , method = "glmnet" , trControl = train_control, verbose=FALSE)
glmnet_toc <- toc(quiet=T)

tic()
glmnet_wow_model = caret::train(is_bullying~., data=wow_train_balanced , method = "glmnet" , trControl = train_control, verbose=FALSE)
glmnet_wow_toc <- toc(quiet=T)

tic()
glmnet_lol_model = caret::train(is_bullying~., data=lol_train_balanced , method = "glmnet" , trControl = train_control, verbose=FALSE)
glmnet_lol_toc <- toc(quiet=T)

glmnet_time_taken <- glmnet_toc$toc - glmnet_toc$tic
glmnet_wow_time_taken <- glmnet_wow_toc$toc - glmnet_wow_toc$tic
glmnet_lol_time_taken <- glmnet_lol_toc$toc - glmnet_lol_toc$tic

glmnet_pred_y = predict(glmnet_model, test_balanced)
glmnet_lol_pred_y = predict(glmnet_lol_model, lol_test_balanced)
glmnet_wow_pred_y = predict(glmnet_wow_model, wow_test_balanced)

We have utilized the glmnet package as it has extremely efficient procedures for utilising lasso or elastic-net regularization for logistic regression. Since this is a classification problem, logistic regression will be used.

When we use caret to train a model by the “glmnet” tag, caret will select a generalized linear model (logistic regression in this case) via penalized maximum likelihood.

Step 6: Hyperparameter Tuning

set.seed(42)
random_train_control_grid = caret::trainControl(method="cv", number=3, search="random")

We decided to go with a random grid search as it simplified the process of hyperparameter tuning as it removed any potential biases that may occur from an uneven distribution of knowledge for the different models trained (i.e. we thought that a random grid search would be much more consistent when comparing tuned models).

SVM Linear kernel

set.seed(42)

tic()
svm_linear_tuned = caret::train(is_bullying~ ., data = train_balanced, method = "svmLinear", trControl = random_train_control_grid, tuneLength=5, verbose=FALSE)
svm_linear_tuned_toc = toc(quiet=T)

tic()
svm_linear_wow_tuned = caret::train(is_bullying~ ., data = wow_train_balanced, method = "svmLinear", trControl = random_train_control_grid, tuneLength=5, verbose=FALSE)
svm_linear_wow_tuned_toc = toc(quiet=T)

tic()
svm_linear_lol_tuned = caret::train(is_bullying~ ., data = lol_train_balanced, method = "svmLinear", trControl = random_train_control_grid, tuneLength=5, verbose=FALSE)
svm_linear_lol_tuned_toc = toc(quiet=T)

svm_linear_tuned_time_taken <- svm_linear_tuned_toc$toc - svm_linear_tuned_toc$tic
svm_linear_wow_tuned_time_taken <- svm_linear_wow_tuned_toc$toc - svm_linear_wow_tuned_toc$tic
svm_linear_lol_tuned_time_taken <- svm_linear_lol_tuned_toc$toc - svm_linear_lol_tuned_toc$tic

svm_tuned_pred_y = predict(svm_linear_tuned, test_balanced)
svm_tuned_wow_pred_y = predict(svm_linear_wow_tuned, wow_test_balanced)
svm_tuned_lol_pred_y = predict(svm_linear_lol_tuned, lol_test_balanced)

SVM Polynomial kernel

set.seed(42)

tic()
svm_model_poly = caret::train(is_bullying~ ., data = train_balanced, method = "svmPoly", trControl = random_train_control_grid, tuneLength=5, verbose=FALSE)
svm_poly_toc = toc(quiet=T)

tic()
svm_lol_model_poly = caret::train(is_bullying~ ., data = lol_train_balanced, method = "svmPoly", trControl = random_train_control_grid, tuneLength=5, verbose=FALSE)
svm_poly_lol_toc = toc(quiet=T)

tic()
svm_wow_model_poly = caret::train(is_bullying~ ., data = wow_train_balanced, method = "svmPoly", trControl = random_train_control_grid, tuneLength=5, verbose=FALSE)
svm_poly_wow_toc = toc(quiet=T)

svm_poly_tuned_time_taken = svm_poly_toc$toc - svm_poly_toc$tic
svm_lol_poly_tuned_time_taken = svm_poly_lol_toc$toc - svm_poly_lol_toc$tic
svm_wow_poly_tuned_time_taken = svm_poly_wow_toc$toc - svm_poly_wow_toc$tic

svm_poly_pred_y = predict(svm_model_poly, test_balanced)
svm_lol_poly_pred_y = predict(svm_lol_model_poly, lol_test_balanced)
svm_wow_poly_pred_y = predict(svm_wow_model_poly, wow_test_balanced)

SVM RBF Kernel

set.seed(42)

tic()
svm_model_rbf = caret::train(is_bullying~ ., data = train_balanced, method = "svmRadial", trControl = random_train_control_grid, tuneLength=5, verbose=FALSE)
svm_rbf_toc = toc(quiet=T)

tic()
svm_lol_model_rbf = caret::train(is_bullying~ ., data = lol_train_balanced, method = "svmRadial", trControl = random_train_control_grid, tuneLength=5, verbose=FALSE)
svm_lol_rbf_toc = toc(quiet=T)

tic()
svm_wow_model_rbf = caret::train(is_bullying~ ., data = wow_train_balanced, method = "svmRadial", trControl = random_train_control_grid, tuneLength=5, verbose=FALSE)
svm_wow_rbf_toc = toc(quiet=T)

# summarizing the results
svm_rbf_tuned_time_taken = svm_rbf_toc$toc - svm_rbf_toc$tic
svm_wow_rbf_tuned_time_taken = svm_wow_rbf_toc$toc - svm_wow_rbf_toc$tic
svm_lol_rbf_tuned_time_taken = svm_lol_rbf_toc$toc - svm_lol_rbf_toc$tic

svm_rbf_pred_y = predict(svm_model_rbf, test_balanced)
svm_wow_rbf_pred_y = predict(svm_wow_model_rbf, wow_test_balanced)
svm_lol_rbf_pred_y = predict(svm_lol_model_rbf, lol_test_balanced)

MLP Tuned

set.seed(42)

tic()
mlp_model_tuned = caret::train(is_bullying~ ., data = train_balanced, method = "mlp", trControl = random_train_control_grid, tuneLength=5)
mlp_toc = toc(quiet=T)

tic()
mlp_lol_model_tuned = caret::train(is_bullying~ ., data = lol_train_balanced, method = "mlp", trControl = random_train_control_grid, tuneLength=5)
mlp_lol_toc = toc(quiet=T)

tic()
mlp_wow_model_tuned = caret::train(is_bullying~ ., data = wow_train_balanced, method = "mlp", trControl = random_train_control_grid, tuneLength=5)
mlp_wow_toc = toc(quiet=T)

mlp_tuned_time_taken = mlp_toc$toc - mlp_toc$tic
mlp_lol_tuned_time_taken = mlp_lol_toc$toc - mlp_lol_toc$tic
mlp_wow_tuned_time_taken = mlp_wow_toc$toc - mlp_wow_toc$tic

mlp_tuned_pred_y = predict(mlp_model_tuned, test_balanced)
mlp_lol_tuned_pred_y = predict(mlp_lol_model_tuned, lol_test_balanced)
mlp_wow_tuned_pred_y = predict(mlp_wow_model_tuned, wow_test_balanced)

GLMNET Tuned

set.seed(42)

tic()
glm_tuned_model = caret::train(is_bullying~., data=train_balanced , method = "glmnet" , trControl = random_train_control_grid, verbose=FALSE, tuneLength=5)
glm_tuned_toc <- toc(quiet=T)

tic()
glm_lol_tuned_model = caret::train(is_bullying~., data=lol_train_balanced , method = "glmnet" , trControl = random_train_control_grid, verbose=FALSE, tuneLength=5)
glm_lol_tuned_toc <- toc(quiet=T)

tic()
glm_wow_tuned_model = caret::train(is_bullying~., data=wow_train_balanced , method = "glmnet" , trControl = random_train_control_grid, verbose=FALSE, tuneLength=5)
glm_wow_tuned_toc <- toc(quiet=T)

glm_tuned_time_taken <- glm_tuned_toc$toc - glm_tuned_toc$tic
glm_lol_tuned_time_taken <- glm_lol_tuned_toc$toc - glm_lol_tuned_toc$tic
glm_wow_tuned_time_taken <- glm_wow_tuned_toc$toc - glm_wow_tuned_toc$tic

glm_tuned_pred_y = predict(glm_tuned_model, test_balanced)
glm_lol_tuned_pred_y = predict(glm_lol_tuned_model, lol_test_balanced)
glm_wow_tuned_pred_y = predict(glm_wow_tuned_model, wow_test_balanced)

Step 7: Evaluation

For this evaluation, we wanted a comprehensive evaluation of the models involved using key metrics such as recall, precision, f1 score, the time taken to train the models and more to give a better idea as to which models perform best. However, we think that the F1 score is the most important metric as we believe that a cyberbullying classifier should minimize the number of false positives (i.e. people who haven’t actually bullied anybody) and false negatives (i.e. bullies that haven’t been identified). The F1 score acts as a trade-off between precision and recall which is exactly what we need. The F1 score is also robust against uneven class distributions unlike accuracy which gives us flexibility in how we should tweak pre-processing.

We have utilised the following to create an effective evaluation:

  1. Created confusion matrices that illustrate where misclassifications are taking place
  2. Plotted ROC curves to display the model performance
  3. Created a table of metrics showing the recall, precision, f1 score and the time taken to train the models We wanted to show how long it took to train each model as this would help identify production costs for each model (the models would take much longer to train on big batches of data).

Creating confusion matrices (combined dataset)

svm_confusion_matrix <- caret::confusionMatrix(svm_pred_y, test_balanced$is_bullying)
mlp_confusion_matrix <- caret::confusionMatrix(mlp_pred_y, test_balanced$is_bullying)
glm_confusion_matrix <- caret::confusionMatrix(glmnet_pred_y, test_balanced$is_bullying)
svm_confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 124  45
##          1  28 113
##                                           
##                Accuracy : 0.7645          
##                  95% CI : (0.7133, 0.8106)
##     No Information Rate : 0.5097          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.5299          
##                                           
##  Mcnemar's Test P-Value : 0.06112         
##                                           
##             Sensitivity : 0.8158          
##             Specificity : 0.7152          
##          Pos Pred Value : 0.7337          
##          Neg Pred Value : 0.8014          
##              Prevalence : 0.4903          
##          Detection Rate : 0.4000          
##    Detection Prevalence : 0.5452          
##       Balanced Accuracy : 0.7655          
##                                           
##        'Positive' Class : 0               
## 
mlp_confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 117  43
##          1  35 115
##                                           
##                Accuracy : 0.7484          
##                  95% CI : (0.6962, 0.7957)
##     No Information Rate : 0.5097          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.4971          
##                                           
##  Mcnemar's Test P-Value : 0.428           
##                                           
##             Sensitivity : 0.7697          
##             Specificity : 0.7278          
##          Pos Pred Value : 0.7312          
##          Neg Pred Value : 0.7667          
##              Prevalence : 0.4903          
##          Detection Rate : 0.3774          
##    Detection Prevalence : 0.5161          
##       Balanced Accuracy : 0.7488          
##                                           
##        'Positive' Class : 0               
## 
glm_confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 133  57
##          1  19 101
##                                          
##                Accuracy : 0.7548         
##                  95% CI : (0.703, 0.8017)
##     No Information Rate : 0.5097         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5118         
##                                          
##  Mcnemar's Test P-Value : 2.194e-05      
##                                          
##             Sensitivity : 0.8750         
##             Specificity : 0.6392         
##          Pos Pred Value : 0.7000         
##          Neg Pred Value : 0.8417         
##              Prevalence : 0.4903         
##          Detection Rate : 0.4290         
##    Detection Prevalence : 0.6129         
##       Balanced Accuracy : 0.7571         
##                                          
##        'Positive' Class : 0              
## 

When examining these confusion matrices, it would appear that the number of false positives (i.e. falsely classifying a post as cyberbullying) is the differentiating factor in terms of f1-score as the glmnet model has the least amount of false positives.

svm_linear_tuned_cf_matrix <- caret::confusionMatrix(svm_tuned_pred_y, test_balanced$is_bullying)
svm_poly_cf_matrix <- caret::confusionMatrix(svm_poly_pred_y, test_balanced$is_bullying)
svm_rbf_cf_matrix <- caret::confusionMatrix(svm_rbf_pred_y, test_balanced$is_bullying)

mlp_tuned_cf_matrix <- caret::confusionMatrix(mlp_tuned_pred_y, test_balanced$is_bullying)
glm_tuned_cf_matrix <- caret::confusionMatrix(glm_tuned_pred_y, test_balanced$is_bullying)

Here we create confusion matrices to get values such as F1 score, precision and recalls.

svm_linear_tuned_cf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 126  46
##          1  26 112
##                                           
##                Accuracy : 0.7677          
##                  95% CI : (0.7167, 0.8136)
##     No Information Rate : 0.5097          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.5365          
##                                           
##  Mcnemar's Test P-Value : 0.02514         
##                                           
##             Sensitivity : 0.8289          
##             Specificity : 0.7089          
##          Pos Pred Value : 0.7326          
##          Neg Pred Value : 0.8116          
##              Prevalence : 0.4903          
##          Detection Rate : 0.4065          
##    Detection Prevalence : 0.5548          
##       Balanced Accuracy : 0.7689          
##                                           
##        'Positive' Class : 0               
## 
svm_poly_cf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 133  47
##          1  19 111
##                                           
##                Accuracy : 0.7871          
##                  95% CI : (0.7373, 0.8313)
##     No Information Rate : 0.5097          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5755          
##                                           
##  Mcnemar's Test P-Value : 0.000889        
##                                           
##             Sensitivity : 0.8750          
##             Specificity : 0.7025          
##          Pos Pred Value : 0.7389          
##          Neg Pred Value : 0.8538          
##              Prevalence : 0.4903          
##          Detection Rate : 0.4290          
##    Detection Prevalence : 0.5806          
##       Balanced Accuracy : 0.7888          
##                                           
##        'Positive' Class : 0               
## 
svm_rbf_cf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 121  47
##          1  31 111
##                                           
##                Accuracy : 0.7484          
##                  95% CI : (0.6962, 0.7957)
##     No Information Rate : 0.5097          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.4976          
##                                           
##  Mcnemar's Test P-Value : 0.08943         
##                                           
##             Sensitivity : 0.7961          
##             Specificity : 0.7025          
##          Pos Pred Value : 0.7202          
##          Neg Pred Value : 0.7817          
##              Prevalence : 0.4903          
##          Detection Rate : 0.3903          
##    Detection Prevalence : 0.5419          
##       Balanced Accuracy : 0.7493          
##                                           
##        'Positive' Class : 0               
## 
mlp_tuned_cf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 112  41
##          1  40 117
##                                           
##                Accuracy : 0.7387          
##                  95% CI : (0.6861, 0.7867)
##     No Information Rate : 0.5097          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.4773          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7368          
##             Specificity : 0.7405          
##          Pos Pred Value : 0.7320          
##          Neg Pred Value : 0.7452          
##              Prevalence : 0.4903          
##          Detection Rate : 0.3613          
##    Detection Prevalence : 0.4935          
##       Balanced Accuracy : 0.7387          
##                                           
##        'Positive' Class : 0               
## 
glm_tuned_cf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 122  45
##          1  30 113
##                                           
##                Accuracy : 0.7581          
##                  95% CI : (0.7064, 0.8047)
##     No Information Rate : 0.5097          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5169          
##                                           
##  Mcnemar's Test P-Value : 0.106           
##                                           
##             Sensitivity : 0.8026          
##             Specificity : 0.7152          
##          Pos Pred Value : 0.7305          
##          Neg Pred Value : 0.7902          
##              Prevalence : 0.4903          
##          Detection Rate : 0.3935          
##    Detection Prevalence : 0.5387          
##       Balanced Accuracy : 0.7589          
##                                           
##        'Positive' Class : 0               
## 

Creating confusion matrices (WoW dataset)

svm_wow_confusion_matrix <- caret::confusionMatrix(svm_wow_pred_y, wow_test_balanced$is_bullying)
mlp_wow_confusion_matrix <- caret::confusionMatrix(mlp_wow_pred_y, wow_test_balanced$is_bullying)
glm_wow_confusion_matrix <- caret::confusionMatrix(glmnet_wow_pred_y, wow_test_balanced$is_bullying)
svm_wow_confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 37 21
##          1  8 54
##                                           
##                Accuracy : 0.7583          
##                  95% CI : (0.6717, 0.8318)
##     No Information Rate : 0.625           
##     P-Value [Acc > NIR] : 0.00133         
##                                           
##                   Kappa : 0.5126          
##                                           
##  Mcnemar's Test P-Value : 0.02586         
##                                           
##             Sensitivity : 0.8222          
##             Specificity : 0.7200          
##          Pos Pred Value : 0.6379          
##          Neg Pred Value : 0.8710          
##              Prevalence : 0.3750          
##          Detection Rate : 0.3083          
##    Detection Prevalence : 0.4833          
##       Balanced Accuracy : 0.7711          
##                                           
##        'Positive' Class : 0               
## 
mlp_wow_confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 35 22
##          1 10 53
##                                           
##                Accuracy : 0.7333          
##                  95% CI : (0.6449, 0.8099)
##     No Information Rate : 0.625           
##     P-Value [Acc > NIR] : 0.008121        
##                                           
##                   Kappa : 0.4599          
##                                           
##  Mcnemar's Test P-Value : 0.051830        
##                                           
##             Sensitivity : 0.7778          
##             Specificity : 0.7067          
##          Pos Pred Value : 0.6140          
##          Neg Pred Value : 0.8413          
##              Prevalence : 0.3750          
##          Detection Rate : 0.2917          
##    Detection Prevalence : 0.4750          
##       Balanced Accuracy : 0.7422          
##                                           
##        'Positive' Class : 0               
## 
glm_wow_confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 43 36
##          1  2 39
##                                           
##                Accuracy : 0.6833          
##                  95% CI : (0.5922, 0.7652)
##     No Information Rate : 0.625           
##     P-Value [Acc > NIR] : 0.1093          
##                                           
##                   Kappa : 0.4131          
##                                           
##  Mcnemar's Test P-Value : 8.636e-08       
##                                           
##             Sensitivity : 0.9556          
##             Specificity : 0.5200          
##          Pos Pred Value : 0.5443          
##          Neg Pred Value : 0.9512          
##              Prevalence : 0.3750          
##          Detection Rate : 0.3583          
##    Detection Prevalence : 0.6583          
##       Balanced Accuracy : 0.7378          
##                                           
##        'Positive' Class : 0               
## 

When examining these confusion matrices, it would appear that the number of false positives (i.e. falsely classifying a post as cyberbullying) is the differentiating factor in terms of f1-score as the glmnet model has the least amount of false positives.

svm_linear_tuned_cf_matrix <- caret::confusionMatrix(svm_tuned_wow_pred_y, wow_test_balanced$is_bullying)
svm_poly_cf_matrix <- caret::confusionMatrix(svm_wow_poly_pred_y, wow_test_balanced$is_bullying)
svm_rbf_cf_matrix <- caret::confusionMatrix(svm_wow_rbf_pred_y, wow_test_balanced$is_bullying)

mlp_tuned_cf_matrix <- caret::confusionMatrix(mlp_wow_tuned_pred_y, wow_test_balanced$is_bullying)
glm_tuned_cf_matrix <- caret::confusionMatrix(glm_wow_tuned_pred_y, wow_test_balanced$is_bullying)

Here we create confusion matrices to get values such as F1 score, precision and recalls.

svm_linear_tuned_cf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 37 19
##          1  8 56
##                                           
##                Accuracy : 0.775           
##                  95% CI : (0.6898, 0.8462)
##     No Information Rate : 0.625           
##     P-Value [Acc > NIR] : 0.000323        
##                                           
##                   Kappa : 0.5424          
##                                           
##  Mcnemar's Test P-Value : 0.054292        
##                                           
##             Sensitivity : 0.8222          
##             Specificity : 0.7467          
##          Pos Pred Value : 0.6607          
##          Neg Pred Value : 0.8750          
##              Prevalence : 0.3750          
##          Detection Rate : 0.3083          
##    Detection Prevalence : 0.4667          
##       Balanced Accuracy : 0.7844          
##                                           
##        'Positive' Class : 0               
## 
svm_poly_cf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 37 21
##          1  8 54
##                                           
##                Accuracy : 0.7583          
##                  95% CI : (0.6717, 0.8318)
##     No Information Rate : 0.625           
##     P-Value [Acc > NIR] : 0.00133         
##                                           
##                   Kappa : 0.5126          
##                                           
##  Mcnemar's Test P-Value : 0.02586         
##                                           
##             Sensitivity : 0.8222          
##             Specificity : 0.7200          
##          Pos Pred Value : 0.6379          
##          Neg Pred Value : 0.8710          
##              Prevalence : 0.3750          
##          Detection Rate : 0.3083          
##    Detection Prevalence : 0.4833          
##       Balanced Accuracy : 0.7711          
##                                           
##        'Positive' Class : 0               
## 
svm_rbf_cf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 38 21
##          1  7 54
##                                          
##                Accuracy : 0.7667         
##                  95% CI : (0.6807, 0.839)
##     No Information Rate : 0.625          
##     P-Value [Acc > NIR] : 0.00067        
##                                          
##                   Kappa : 0.5314         
##                                          
##  Mcnemar's Test P-Value : 0.01402        
##                                          
##             Sensitivity : 0.8444         
##             Specificity : 0.7200         
##          Pos Pred Value : 0.6441         
##          Neg Pred Value : 0.8852         
##              Prevalence : 0.3750         
##          Detection Rate : 0.3167         
##    Detection Prevalence : 0.4917         
##       Balanced Accuracy : 0.7822         
##                                          
##        'Positive' Class : 0              
## 
mlp_tuned_cf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 34 19
##          1 11 56
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.6627, 0.8245)
##     No Information Rate : 0.625           
##     P-Value [Acc > NIR] : 0.002533        
##                                           
##                   Kappa : 0.485           
##                                           
##  Mcnemar's Test P-Value : 0.201243        
##                                           
##             Sensitivity : 0.7556          
##             Specificity : 0.7467          
##          Pos Pred Value : 0.6415          
##          Neg Pred Value : 0.8358          
##              Prevalence : 0.3750          
##          Detection Rate : 0.2833          
##    Detection Prevalence : 0.4417          
##       Balanced Accuracy : 0.7511          
##                                           
##        'Positive' Class : 0               
## 

It would appear that tuning MLP parameters through random grid search has actually decreased the accuracy of the model.

glm_tuned_cf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 42 25
##          1  3 50
##                                          
##                Accuracy : 0.7667         
##                  95% CI : (0.6807, 0.839)
##     No Information Rate : 0.625          
##     P-Value [Acc > NIR] : 0.00067        
##                                          
##                   Kappa : 0.5466         
##                                          
##  Mcnemar's Test P-Value : 7.229e-05      
##                                          
##             Sensitivity : 0.9333         
##             Specificity : 0.6667         
##          Pos Pred Value : 0.6269         
##          Neg Pred Value : 0.9434         
##              Prevalence : 0.3750         
##          Detection Rate : 0.3500         
##    Detection Prevalence : 0.5583         
##       Balanced Accuracy : 0.8000         
##                                          
##        'Positive' Class : 0              
## 

Creating confusion matrices (LoL dataset)

svm_lol_confusion_matrix <- caret::confusionMatrix(svm_lol_pred_y, lol_test_balanced$is_bullying)
mlp_lol_confusion_matrix <- caret::confusionMatrix(mlp_lol_pred_y, lol_test_balanced$is_bullying)
glm_lol_confusion_matrix <- caret::confusionMatrix(glmnet_lol_pred_y, lol_test_balanced$is_bullying)
svm_lol_confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 70 36
##          1 16 68
##                                           
##                Accuracy : 0.7263          
##                  95% CI : (0.6571, 0.7884)
##     No Information Rate : 0.5474          
##     P-Value [Acc > NIR] : 2.986e-07       
##                                           
##                   Kappa : 0.4586          
##                                           
##  Mcnemar's Test P-Value : 0.008418        
##                                           
##             Sensitivity : 0.8140          
##             Specificity : 0.6538          
##          Pos Pred Value : 0.6604          
##          Neg Pred Value : 0.8095          
##              Prevalence : 0.4526          
##          Detection Rate : 0.3684          
##    Detection Prevalence : 0.5579          
##       Balanced Accuracy : 0.7339          
##                                           
##        'Positive' Class : 0               
## 
mlp_lol_confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 60 31
##          1 26 73
##                                           
##                Accuracy : 0.7             
##                  95% CI : (0.6294, 0.7642)
##     No Information Rate : 0.5474          
##     P-Value [Acc > NIR] : 1.195e-05       
##                                           
##                   Kappa : 0.3976          
##                                           
##  Mcnemar's Test P-Value : 0.5962          
##                                           
##             Sensitivity : 0.6977          
##             Specificity : 0.7019          
##          Pos Pred Value : 0.6593          
##          Neg Pred Value : 0.7374          
##              Prevalence : 0.4526          
##          Detection Rate : 0.3158          
##    Detection Prevalence : 0.4789          
##       Balanced Accuracy : 0.6998          
##                                           
##        'Positive' Class : 0               
## 
glm_lol_confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 77 39
##          1  9 65
##                                           
##                Accuracy : 0.7474          
##                  95% CI : (0.6794, 0.8075)
##     No Information Rate : 0.5474          
##     P-Value [Acc > NIR] : 9.858e-09       
##                                           
##                   Kappa : 0.5051          
##                                           
##  Mcnemar's Test P-Value : 2.842e-05       
##                                           
##             Sensitivity : 0.8953          
##             Specificity : 0.6250          
##          Pos Pred Value : 0.6638          
##          Neg Pred Value : 0.8784          
##              Prevalence : 0.4526          
##          Detection Rate : 0.4053          
##    Detection Prevalence : 0.6105          
##       Balanced Accuracy : 0.7602          
##                                           
##        'Positive' Class : 0               
## 

When examining these confusion matrices, it would appear that the number of false positives (i.e. falsely classifying a post as cyberbullying) is the differentiating factor in terms of f1-score as the glmnet model has the least amount of false positives.

svm_lol_linear_tuned_cf_matrix <- caret::confusionMatrix(svm_tuned_lol_pred_y, lol_test_balanced$is_bullying)
svm_lol_poly_cf_matrix <- caret::confusionMatrix(svm_lol_poly_pred_y, lol_test_balanced$is_bullying)
svm_lol_rbf_cf_matrix <- caret::confusionMatrix(svm_lol_rbf_pred_y, lol_test_balanced$is_bullying)

mlp_lol_tuned_cf_matrix <- caret::confusionMatrix(mlp_lol_tuned_pred_y, lol_test_balanced$is_bullying)
glm_lol_tuned_cf_matrix <- caret::confusionMatrix(glm_lol_tuned_pred_y, lol_test_balanced$is_bullying)

Here we create confusion matrices to get values such as F1 score, precision and recalls.

svm_lol_linear_tuned_cf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 63 40
##          1 23 64
##                                           
##                Accuracy : 0.6684          
##                  95% CI : (0.5966, 0.7349)
##     No Information Rate : 0.5474          
##     P-Value [Acc > NIR] : 0.0004544       
##                                           
##                   Kappa : 0.3421          
##                                           
##  Mcnemar's Test P-Value : 0.0438198       
##                                           
##             Sensitivity : 0.7326          
##             Specificity : 0.6154          
##          Pos Pred Value : 0.6117          
##          Neg Pred Value : 0.7356          
##              Prevalence : 0.4526          
##          Detection Rate : 0.3316          
##    Detection Prevalence : 0.5421          
##       Balanced Accuracy : 0.6740          
##                                           
##        'Positive' Class : 0               
## 
svm_lol_poly_cf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 54 47
##          1 32 57
##                                           
##                Accuracy : 0.5842          
##                  95% CI : (0.5106, 0.6551)
##     No Information Rate : 0.5474          
##     P-Value [Acc > NIR] : 0.1718          
##                                           
##                   Kappa : 0.1734          
##                                           
##  Mcnemar's Test P-Value : 0.1152          
##                                           
##             Sensitivity : 0.6279          
##             Specificity : 0.5481          
##          Pos Pred Value : 0.5347          
##          Neg Pred Value : 0.6404          
##              Prevalence : 0.4526          
##          Detection Rate : 0.2842          
##    Detection Prevalence : 0.5316          
##       Balanced Accuracy : 0.5880          
##                                           
##        'Positive' Class : 0               
## 
svm_lol_rbf_cf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 34 21
##          1 52 83
##                                           
##                Accuracy : 0.6158          
##                  95% CI : (0.5426, 0.6853)
##     No Information Rate : 0.5474          
##     P-Value [Acc > NIR] : 0.033684        
##                                           
##                   Kappa : 0.1997          
##                                           
##  Mcnemar's Test P-Value : 0.000446        
##                                           
##             Sensitivity : 0.3953          
##             Specificity : 0.7981          
##          Pos Pred Value : 0.6182          
##          Neg Pred Value : 0.6148          
##              Prevalence : 0.4526          
##          Detection Rate : 0.1789          
##    Detection Prevalence : 0.2895          
##       Balanced Accuracy : 0.5967          
##                                           
##        'Positive' Class : 0               
## 
mlp_lol_tuned_cf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 66 30
##          1 20 74
##                                           
##                Accuracy : 0.7368          
##                  95% CI : (0.6682, 0.7979)
##     No Information Rate : 0.5474          
##     P-Value [Acc > NIR] : 5.718e-08       
##                                           
##                   Kappa : 0.4742          
##                                           
##  Mcnemar's Test P-Value : 0.2031          
##                                           
##             Sensitivity : 0.7674          
##             Specificity : 0.7115          
##          Pos Pred Value : 0.6875          
##          Neg Pred Value : 0.7872          
##              Prevalence : 0.4526          
##          Detection Rate : 0.3474          
##    Detection Prevalence : 0.5053          
##       Balanced Accuracy : 0.7395          
##                                           
##        'Positive' Class : 0               
## 

It would appear that tuning MLP parameters through random grid search has actually decreased the accuracy of the model.

glm_lol_tuned_cf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 71 34
##          1 15 70
##                                           
##                Accuracy : 0.7421          
##                  95% CI : (0.6738, 0.8027)
##     No Information Rate : 0.5474          
##     P-Value [Acc > NIR] : 2.406e-08       
##                                           
##                   Kappa : 0.4893          
##                                           
##  Mcnemar's Test P-Value : 0.01013         
##                                           
##             Sensitivity : 0.8256          
##             Specificity : 0.6731          
##          Pos Pred Value : 0.6762          
##          Neg Pred Value : 0.8235          
##              Prevalence : 0.4526          
##          Detection Rate : 0.3737          
##    Detection Prevalence : 0.5526          
##       Balanced Accuracy : 0.7493          
##                                           
##        'Positive' Class : 0               
## 

ROC Curves (combined dataset)

plot_roc_curves <- function(model_1_pred_y, model_2_pred_y, model_3_pred_y, model_names, test_data) {
  model_1_pred_y <- as.numeric(levels(model_1_pred_y))[model_1_pred_y]
  model_2_pred_y <- as.numeric(levels(model_2_pred_y))[model_2_pred_y]
  model_3_pred_y <- as.numeric(levels(model_3_pred_y))[model_3_pred_y]

  par(pty="s")
  model_1_roc <- roc(test_data$is_bullying~model_1_pred_y, plot=TRUE, print.auc=TRUE, col="red", lwd=4, legacy.axes=TRUE, main="ROC Curves")
  model_2_roc <- roc(test_data$is_bullying~model_2_pred_y, plot=TRUE, print.auc=TRUE, print.auc.y=0.4, col="blue", lwd=4, add=TRUE)
  model_3_roc <- roc(test_data$is_bullying, model_3_pred_y, plot=TRUE, print.auc=TRUE, print.auc.y=0.6, col="green", lwd=4, add=TRUE)
  
  legend("bottomright", legend=model_names, col=c("red", "blue", "green"), lwd=4)
}

Non-tuned models

plot_roc_curves(svm_pred_y, mlp_pred_y, glmnet_pred_y, c("SVM", "MLP", "GLMNET"), test_balanced)

Comparing tuned SVM models

plot_roc_curves(svm_tuned_pred_y, svm_poly_pred_y, svm_rbf_pred_y, c("SVM Linear", "SVM Poly", "SVM RBF"), test_balanced)

It would appear that the polynomial kernel is most suitable for bullying classification. We will compare this SVM model that uses the polynomial kernel with other models.

Tuned models

plot_roc_curves(svm_poly_pred_y, mlp_tuned_pred_y, glm_tuned_pred_y, c("SVM Poly", "MLP Tuned", "GLMNET"), test_balanced)

ROC Curves (WoW dataset)

Non-tuned models

plot_roc_curves(svm_wow_pred_y, mlp_wow_pred_y, glmnet_wow_pred_y, c("SVM", "MLP", "GLMNET"), wow_test_balanced)

It would appear that the glmnet model is very effective when used with the world of warcraft dataset as it has a much greater specificity than the other models. However, MLP has a slightly higher AUC.

Comparing tuned SVM models

plot_roc_curves(svm_tuned_wow_pred_y, svm_wow_poly_pred_y, svm_wow_rbf_pred_y, c("SVM Linear", "SVM Poly", "SVM RBF"), wow_test_balanced)

It would appear that the RBF model has the greatest AUC. We can compare this model with other models.

Tuned models

plot_roc_curves(svm_wow_rbf_pred_y, mlp_wow_tuned_pred_y, glm_wow_tuned_pred_y, c("SVM RBF", "MLP Tuned", "GLMNET"), wow_test_balanced)

ROC Curves (LoL dataset)

Non-tuned models

plot_roc_curves(svm_lol_pred_y, mlp_lol_pred_y, glmnet_lol_pred_y, c("SVM", "MLP", "GLMNET"), lol_test_balanced)

It appears that glmnet consistently has good specificity when non-tuned, giving it a decent AUC score.

Comparing tuned SVM models

plot_roc_curves(svm_tuned_lol_pred_y, svm_lol_poly_pred_y, svm_lol_rbf_pred_y, c("SVM Linear", "SVM Poly", "SVM RBF"), lol_test_balanced)

It would appear that the linear kernel is most suitable for bullying classification. We will compare this SVM model that uses the linear kernel with other models.

Tuned models

plot_roc_curves(svm_tuned_lol_pred_y, mlp_lol_tuned_pred_y, glm_lol_tuned_pred_y, c("SVM Poly", "MLP Tuned", "GLMNET"), lol_test_balanced)

It would appear that MLP has the greatest AUC score when tuned.

Table of metrics

For each model, we will be calculating the evaluation metrics we will be using (time taken, precision, sensitivity, f1 score, AUC) to create a table of metrics which can be used to evaluate each model. We are acquiring the time taken to train each model to see how practical the models would be if the models were to be trained continuously with new data in terms of computation. While it doesn’t take that long to train the models with the sample data, it would take much longer if there was more data.

create_table <- function(prediction_list, time_to_train_c, test_data) {
  Model_Name = names(prediction_list)
  Precision = c()
  Recall = c()
  F1_Score = c()
  AUC = c()
  Time_to_Train_secs = time_to_train_c
  
  for (prediction_y in prediction_list) {
    Precision <- append(Precision, precision(prediction_y, test_data$is_bullying))
    Recall <- append(Recall, recall(prediction_y, test_data$is_bullying))
    F1_Score <- append(F1_Score, F_meas(prediction_y, test_data$is_bullying))
    AUC <- append(AUC, auc(test_data$is_bullying, as.integer(prediction_y)))
  }
  
  results <- data.frame(Model_Name, Precision, Recall, F1_Score, AUC, Time_to_Train_secs)
  return(results)
}

Combined dataset table

prediction_list = list(svm_linear = svm_tuned_pred_y, svm_poly = svm_poly_pred_y, svm_rbf = svm_rbf_pred_y, mlp = mlp_tuned_pred_y, glmnet = glm_tuned_pred_y)
time_to_train = c(svm_time_taken, svm_poly_tuned_time_taken, svm_rbf_tuned_time_taken, mlp_time_taken, glmnet_time_taken)
results = create_table(prediction_list, time_to_train, test_balanced)
results

From these results, we can see that the svm_poly model had the greatest f1_score and the second greatest AUC score where it seems to be equally as good as the glmnet. Choosing between the two models may come down to prioritising either precision or recall. Both models are also relatively quick to train which can reduce production costs.

We can also see that the mlp model performed the worst as it took a long time to train on a relatively small sample of data (almost 2 minutes) and it has the worst F1 and AUC scores. However, it does have the highest precision (at the cost of a relatively bad recall).

WoW dataset table

prediction_list = list(svm_linear = svm_tuned_wow_pred_y, svm_poly = svm_wow_poly_pred_y, svm_rbf = svm_wow_rbf_pred_y, mlp = mlp_wow_tuned_pred_y, glmnet = glm_wow_tuned_pred_y)
time_to_train = c(svm_wow_time_taken, svm_wow_poly_tuned_time_taken, svm_wow_rbf_tuned_time_taken, mlp_wow_time_taken, glmnet_wow_time_taken)
wow_results = create_table(prediction_list, time_to_train, wow_test_balanced)
results

From these results, we can see that the svm_poly model had the greatest f1_score and the second greatest AUC score where it seems to be equally as good as the glmnet. Choosing between the two models may come down to prioritising either precision or recall. Both models are also relatively quick to train which can reduce production costs.

We can also see that the mlp model performed the worst as it took a long time to train on a relatively small sample of data (almost 2 minutes) and it has the worst F1 and AUC scores. However, it does have the highest precision (at the cost of a relatively bad recall).

LoL Dataset Table

prediction_list = list(svm_linear = svm_tuned_lol_pred_y, svm_poly = svm_lol_poly_pred_y, svm_rbf = svm_lol_rbf_pred_y, mlp = mlp_lol_tuned_pred_y, glmnet = glm_lol_tuned_pred_y)
time_to_train = c(svm_lol_time_taken, svm_lol_poly_tuned_time_taken, svm_lol_rbf_tuned_time_taken, mlp_lol_time_taken, glmnet_lol_time_taken)
lol_results = create_table(prediction_list, time_to_train, lol_test_balanced)
lol_results

From these results, we can see that the svm_poly model has the greatest F1_Score whilst the svm_linear has a slightly higher AUC. However, svm_poly has the worst precision whereas glmnet has the highest precision. We could choose the glmnet model as the precision and recall are not as far apart from each other. ### Output results

write.csv(lol_results, "Data/lol_results.csv")
write.csv(wow_results, "Data/wow_results.csv")
write.csv(results, "Data/results.csv")

We output the table results to a csv file to make it easy to share results with other group members.

Individual contributions

Thomas Newton

I was responsible for predictive modelling and hyperparameter tuning. I also created the problem statements and references.

Leon Harper

I was responsible for importing the data (converting it from an SQL Dump file into corresponding csv files), cleaning the data and evaluating the performance of relevant models. I helped manage the project through the use of a trello board and I helped individuals on the project who were struggling.

Michal Jedruszczak

In this project I was responsible for the preliminary EDA and the final EDA. I also created the contents section and made sure that all of the references were in MMU Harvard format.

References

Anti-Defamation League (2022) Hate is No Game: Harassment and Positive Social Experiences in Online Games 2021 [Online][Accessed: November 19, 2022] https://www.adl.org/resources/report/hate-no-game-harassment-and-positive-social-experiences-online-games-2021

Bretschneider, U. and Peters, R. “Detecting Cyberbullying in Online Communities” (2016). Research Papers. Paper 61. http://aisel.aisnet.org/ecis2016_rp/61